Rationale and Research Questions

My interest in maritime industries led me to explore data from Global Fishing Watch, an international non-profit organization that provides open source data on global fishing activity. Through their data portal I discovered a data set that tracks the locations of two longline fishing vessels throughout their fishing excursions; at each point it was marked whether the vessel was determined to be fishing or not fishing. Finding this dataset sparked my curiosity as to what other things were going on at each point and whether there were patterns between other variables and fishing activity.

Because my data set consisted of two separate vessels embarking on fishing excursions in two different regions of the world, I decided to frame my investigation by separately investigating the full extent of the data, the vessel 1 observations and the vessel 2 observations. This approach resulted in me asking the following three research questions:

  1. What is the optimal set of variables that predict fishing activity across the full extent of the data set?
  2. Which variable(s) have the strongest explanatory power for predicting fishing activity across vessel 1 observations?
  3. Which variables(s) have the strongest explanatory power for predicting fishing activity across vessel 2 observations?

Dataset Information

The data used in my analysis are descriptive vessel tracking information from Global Fishing Watch that I supplemented with net primary productivity data from a SESYNC shiny app. Both components are briefly described below and fully described in my project documentation.

Data from Global Fishing Watch (GFW): Longline Vessel Tracking Data (CSV)

  • These longline data, like many data sets from Global Fishing Watch, originated from raw automatic identification system (AIS) data and were processed and released. By analyzing movement patterns, Global Fishing Watch’s neural networks transform raw AIS data into contextual information about fishing activity.
  • The longline data I obtained is a CSV file that includes locations, times and fishing activity status of two different vessels. There are additional attributes of ‘distance from shore’ and ‘vessel speed’.
  • Link to data source: https://globalfishingwatch.org/datasets-and-code/

Data from the National Socio-Environmental Synthesis Center (SESYNC): Net Primary Productivity Data (CSV)

  • SESYNC’s Marine Socio-Environmental Covariates Shiny App provides oceanographic information based on latitude longitude locations that can be fed into the app. To supplement my longline fishing dataset, I obtained net primary productivity (NPP) data from each vessel observation location in my dataset. Majority of the NPP values lined up with the coordinates I provided, and a portion of the values were interpolated.
  • Net primary productivity (NPP) data are reported as average values in milligrams of carbon per meter squared per day (mg C/m2 day).
  • Link to data source: https://shiny.sesync.org/apps/msec/

Exploratory Analysis

Data Wrangling

The raw longline data had 65,499 observations and 11 variables when downloaded from Global Fishing Watch.

# Longline data
str(longline_full) 
## 'data.frame':    65499 obs. of  11 variables:
##  $ X                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ mmsi               : num  1.26e+13 1.26e+13 1.26e+13 1.26e+13 1.26e+13 ...
##  $ timestamp          : int  1327136504 1327136605 1327136734 1327143281 1327143341 1327143411 1327146440 1327149860 1327149911 1327156390 ...
##  $ distance_from_shore: num  232994 233994 233994 233994 233996 ...
##  $ distance_from_port : num  311749 312410 312410 315417 316173 ...
##  $ speed              : num  8.2 7.3 6.8 6.9 6.1 ...
##  $ course             : num  230 238 239 252 231 ...
##  $ lat                : num  14.9 14.9 14.9 14.8 14.8 ...
##  $ lon                : num  -26.9 -26.9 -26.9 -26.9 -26.9 ...
##  $ is_fishing         : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ source             : Factor w/ 1 level "dalhousie_longliner": 1 1 1 1 1 1 1 1 1 1 ...
  1. Fising Activity Describes fishing activity as fishing (1), not fishing (0) and unknown (-1)) I examined the ‘is_fishing’ field to see how many observations were classified as ‘fishing’, ‘not fishing’, and ‘unknown’. Since I am interested in modeling a binary fishing activity status (‘fishing’ vs. ‘not fishing’), I narrowed the data to only include observations with either a ‘fishing’ or ‘not fishing’ status, leaving me with 4,189 observations.
# fishing activity counts
x <- count(longline_full, longline_full$is_fishing == 0)  
y <- count(longline_full, longline_full$is_fishing == 1)  
z <- count(longline_full, longline_full$is_fishing == -1) 

print(x, "1,397 'not fishing' statuses (2.13% not fishing)")
## # A tibble: 2 x 2
##   `longline_full$is_fishing == 0`     n
##   <lgl>                           <int>
## 1 FALSE                           64102
## 2 TRUE                             1397
print(y, "2,792 'fishing' statuses (4.26% fishing)")
## # A tibble: 2 x 2
##   `longline_full$is_fishing == 1`     n
##   <lgl>                           <int>
## 1 FALSE                           62707
## 2 TRUE                             2792
print(z, "61,310 'no data' statuses (93.6% unknown -- eliminate these)")
## # A tibble: 2 x 2
##   `longline_full$is_fishing == -1`     n
##   <lgl>                            <int>
## 1 FALSE                             4189
## 2 TRUE                             61310
# narrow data to only include instances of 'fishing' and 'not fishing' 
longline_fishing <- filter(longline_full, is_fishing %in% c(0, 1))
  1. MMSI Unique identifier for individual fishing vessels To determine the number of vessels included in the data set, I counted the unique field values and then renamed each value to be more distinguishable. I later separated the data into two data frames for exploration, one for each vessel.
## [1] 1.263956e+13 5.139444e+13
## [1] Vessel 1 Vessel 2
## Levels: Vessel 1 Vessel 2
  1. NPP Net Primary Productivity data (downloaded from SESYNC) To supplement the information available for each vessel location, I acquired net primary productivity (NPP) data from SESYNC and joined it to my processed longline data. At the end of processing, I had selected the variables I wanted to test for explanatory power in my models and converted them to appropriate formats for analysis.
## 'data.frame':    4189 obs. of  10 variables:
##  $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MMSI               : Factor w/ 2 levels "Vessel 1","Vessel 2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date               : Date, format: "2012-06-02" "2012-06-19" ...
##  $ Latitude           : Factor w/ 4049 levels "00:00:38","00:01:02",..: 2240 719 570 290 419 2165 2437 532 415 688 ...
##  $ Longitude          : num  18.6 18.8 19.3 18.9 19.1 ...
##  $ Vessel_Speed       : num  -17.2 -19.5 -17.3 -17.3 -17.1 ...
##  $ Distance_From_Shore: num  8.2 5 0.7 4.2 7 ...
##  $ NPP_Mean           : num  111123 329079 86831 98881 74248 ...
##  $ Fishing_Activity   : Factor w/ 914 levels "261.947729292956",..: 896 830 902 893 912 851 844 711 698 663 ...
##  $ NA                 : int  1 1 1 1 1 1 1 0 0 1 ...

Entity and Attribute information for processed data set:

Data Field Definition Units Source
ID Unique identifier for observation NA GFW
MMSI Unique identifier for vessel NA GFW
Date Date of observation YYYY-MM-DD GFW
Latitude Latitude coordinate of observation Decimal Degrees GFW
Longitude Longitude coordinate of observation Decimal Degrees GFW
Vessel_Speed Speed of vessel at observed point Knots GFW
Distance_From_Shore Distance vessel is observed from shore Meters GFW
NPP_Mean Mean net primary productivity value mg C/m2 day SESYNC
Fishing_Activity Indication of whether observed vessel is determined to be fishing (1) or not fishing (0) based on GFW algorithms NA GFW

Data Exploration – Mapping

Full Extent Map

Exploring the data in space revealed that the two tracked vessels were fishing in two different parts of the world.

<This map shows both vessel 1 observations (in the east Atlantic between Spain and Africa) and vessel 2 observations (in the eastern Pacific between Washington state and Alaska).>

Vessel 1 - Fishing Activity

Here the extent is narrowed to show just vessel 1 observations.

<This map shows vessel 1 observations distinguished by the presence or ansence of fishing activity. Yellow signifies points where the vessel was determined to be fishing while purple signifies points where it was not.>

Vessel 1 - Predictor Variables

Here is an exploratory view of the additional data included for each observation point.

<This map shows how each variable’s range is distributed across the observation points for vessel 1. These will be further examined in the analysis to determine how each variable correlates with fishing activity. Note that latitude is not shown as a predictor here due to its redundancy with distance from shore in this instance.>

Vessel 2 - Fishing Activity

Here the extent is narrowed to show just vessel 1 observations.

<This map shows vessel 2 observations distinguished by the fishing activity, yellow signifies fishing while purple signifies not fishing.>

Vessel 2 - Predictor Variables

Here is an exploratory view of the additional data included for each observation point.

<This map shows how each variable’s range is distributed across the observation points for vessel 2. These will be further examined in the analysis to determine how each variable correlates with fishing activity. Note that longitude is not shown as a variable here due to its redundancy with distance from shore in this instance.>

Analysis

Explanatory Power of Predictors Across Full Data Extent

General Approach: Generally, my goal in this analysis was to assess the explanatory power of each variable in predicting fishing behavior. To do this I ran binomial regression models that used five data fields (latitude, longitude, vessel speed, distance from shore and NPP value) as predictor variables to attempt to describe the binomial condition of fishing vs. not fishing. I started by examining the variables as predictors for the data set as a whole, then took a closer look at how effectively the same predictors described the fishing activity for each vessel individually.

Full Extent Model Outputs: The main model was run with all five predictor variables to start, resulting in ~42.525% deviance explained and an AIC value of 3077.515. An AIC step analysis was then run to determine the optimal combination of variables for predicting fishing activity across the full extent of the data set. The AIC analysis determined that removing the NPP Mean variable from the model would result in about the same deviance explained (42.519%) and a slightly lower AIC. These outputs as well as the relative explanatory power of each variable is shown below.

## [1] "Percent Deviance Explained:" "42.5248845961546"           
## [3] "AIC:"                        "3077.51541233655"

Question 1: What is the optimal set of variables that predict fishing activity across the full extent of the data set?

Latitude, Longitude, Distance from Shore and Vessel Speed

Optimal Predictor Variables for Full Extent:

## [1] "Percent Deviance Explained:" "42.5191684723478"           
## [3] "AIC:"                        "3075.82028976924"

Discussion of Full Extent Model results Obtaining these results after removing the NPP Mean variable from the Full Extent model reveals that the remaining four variables can explain about the same amount of deviance in a slightly more parsimonious way. This makes sense when looking at the weak correlation shown between NPP Mean and likelihood of fishing activity (shown in the Full Extent Correlation Plots above). These correlation plots also show the strongest predictor variables to be latitude and longitude. The effectiveness of latitude and longitude as predictors of fishing may be skewed due to our data being clustered into two small regions of the ocean, resulting in our latitude and longitude ranges being relatively small.

Overall, the deviance explained from the Full Extent Model is less than 50%, which suggests these predictor variables don’t do a great job at predicting fishing activity across the full extent of the data set.

Explanatory Power of Predictors Across Vessel 1 Observations

Vessel 1 Model Outputs: Narrowing the extent of the analysis to only include observation points from vessel 1, a binomial regression model with the same five variables (latitude, longitude, vessel speed, distance from shore, and mean NPP value) was run. The vessel 1 model variables were determined to be the most parsimonious set of variables to predict fishing activity as determined by an AIC step analysis. The percent deviance explained from this model was 56.039% and the AIC value was 186.922.

## [1] "Percent Deviance Explained:" "56.0390650778037"           
## [3] "AIC:"                        "186.922949225996"

Question 2: Which variable had the strongest explanatory power for fishing activity among vessel 1 observations? The variable with the strongest explanatory power in predicting fishing activity off the coast of Africa was the mean NPP value. The positive correlation between fishing activity (left) and mean NPP (in mg C/m2 day) (right) is shown visually in the map below.

Explanatory Power of Predictor Variables Across Vessel 2 Observations

Vessel 2 Model Outputs:

## [1] "Percent Deviance Explained:" "87.7264359453294"           
## [3] "AIC:"                        "617.588698347268"

Question 3: Which variable had the strongest explanatory power for fishing activity among vessel 2 observations? The variable with the strongest explanatory power in predicting fishing activity off the coast of Alaska was latitude. The positive correlation between fishing activity (left) and latitude (right) is shown visually in the map below. Note that the trajectory of the vessel’s travel path to its fishing grounds lines up well with increasing latitude lines by coincidence, thus increasing latitude should not be misinterpreted as a good predictor of fishing activity in all instances.

General Discussion

Comparability of Vessel 1 and Vessel 2 Models

It is important to note that due to differences in sample size and variable value ranges between vessel 1 and vessel 2 data, the percent deviance explained and AIC values are not comparable between the two models. Because the models were run on fishing activity in different parts of the globe, there were significant differences between the range of latitude, longitude and mean NPP value. This suggests that the each model is probably better suited to predict fishing activity in areas that have similar ranges in these variables. This is not an indication of poor model strength but rather an emphasis on the natural differences between variables and qualities that merit fishing activity between different ocean ecosystems across the globe. Additionally, sample size differed drastically between vessels 1 and 2 as well as between instances of fishing and not fishing. For stronger conclusions to be made, it would be preferable to have larger and more equitable sample sizes between both the vessels and the instances of fishing activity.

Further Analysis

This exploratory analysis provided insight into the strength of five variables in predicting fishing activity in two different marine regions. It would be interesting to further the analysis by testing the power of the models in predicting fishing activity for points where the fishing status is unknown. Ideally this would be done with the addition of good quality oceanographic variables that are collected at time intervals that match the granularity of the global fishing watch observations. Based on my analysis, I wouldn’t expect that most sets of regional variables would be good predictors across different marine ecosystems, particularly ones at different latitudes, but I think that with good quality inputs a model similar to this one could do a fair job at predicting fishing activity on a regional scale.